Skip to content

Conversation

@samxbr
Copy link
Contributor

@samxbr samxbr commented Jan 31, 2025

Currently the SLM health indicator in health report turns YELLOW when snapshots fail for a number of times. However, the SLM health indicator stays GREEN if snapshot is not completed (no success or failure) for a long time. This change adds a new optional setting unhealthy_if_no_snapshot_within to SLM policy, that sets a time threshold. If the SLM policy has not had a successful snapshot for longer than the threshold, the SLM health indicator will turn YELLOW.

@samxbr samxbr marked this pull request as draft January 31, 2025 08:12
@samxbr samxbr changed the title Improve SLM Health Indicator to cover for missing successful snapshot for a long time Improve SLM Health Indicator to cover missing snapshot Feb 3, 2025
Comment on lines 383 to 392
if (policy.getLastSuccess() != null) {
// prefer snapshotStartTimestamp over snapshotFinishTimestamp in case of a very long-running snapshot
// that started a long time ago
SnapshotInvocationRecord lastSuccess = policy.getLastSuccess();
return lastSuccess.getSnapshotStartTimestamp() != null
? lastSuccess.getSnapshotStartTimestamp()
: lastSuccess.getSnapshotFinishTimestamp();
}
// SX TODO: handle first snapshot (i.e. no prior success of failure), maybe record the policy first trigger timestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't yet handle the case for the first snapshot (i.e. no prior successful snapshot). To handle that, I am thinking to record the first SLM trigger time in SLM metadata SnapshotLifecyclePolicyMetadata so here we can check against that. I can do that in a follow up PR if the idea makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to handle that particular case, so +1 to investigate it after this PR

@samxbr samxbr added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Feb 4, 2025
@samxbr samxbr requested a review from dakrone February 4, 2025 15:32
@samxbr samxbr marked this pull request as ready for review February 4, 2025 15:33
@samxbr samxbr requested a review from a team as a code owner February 4, 2025 15:33
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Feb 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @samxbr, I've created a changelog YAML for you.

@samxbr
Copy link
Contributor Author

samxbr commented Feb 4, 2025

Due to current doc freeze, the doc change for this PR is not added yet. It will likely be in a separate PR after the doc freeze.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, but this generally looks good! What do you think of the naming and validation questions?

"invalid missingSnapshotUnhealthyThreshold ["
+ missingSnapshotUnhealthyThreshold.getStringRep()
+ "]: "
+ "time is too short, expecting at least more than the interval between snapshots ["
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about this validation, on one hand, should we actually prevent someone from making the threshold too small, given that they can always manually execute the SLM policy to take a snapshot? I'm not sure it's worth it. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this validation is not strictly necessary, the worse case without it is that the indicator will always be YELLOW if the threshold is set too small.

I am slightly leaning towards keeping it, my philosophy here is to providing quick feedback for user over letting them find out through health indicator. I can't think of a use case where user would set this threshold smaller than the snapshot interval. I also can't think of a drawback for keeping this validation, it's ok if user occasionally execute SLM manually, it just resets the threshold time, and doesn't defeat the purpose of this setting. We can definitely remove this validation if it can have potential negative impact I didn't think of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards keeping the validation as well. I also don't really see a reason why we would allow users to do this, but I do see value in avoiding support tickets where users to do this. Plus, we get unhealthyIfNoSnapshotWithin > 0 validation for free.

Comment on lines 383 to 392
if (policy.getLastSuccess() != null) {
// prefer snapshotStartTimestamp over snapshotFinishTimestamp in case of a very long-running snapshot
// that started a long time ago
SnapshotInvocationRecord lastSuccess = policy.getLastSuccess();
return lastSuccess.getSnapshotStartTimestamp() != null
? lastSuccess.getSnapshotStartTimestamp()
: lastSuccess.getSnapshotFinishTimestamp();
}
// SX TODO: handle first snapshot (i.e. no prior success of failure), maybe record the policy first trigger timestamp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to handle that particular case, so +1 to investigate it after this PR

? lastSuccess.getSnapshotStartTimestamp()
: lastSuccess.getSnapshotFinishTimestamp();
}
// SX TODO: handle first snapshot (i.e. no prior success of failure), maybe record the policy first trigger timestamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can change this to just a regular TODO instead of one specific to you :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about this one :)

@samxbr
Copy link
Contributor Author

samxbr commented Feb 10, 2025

Hi @nielsbauman, I added you as a backup reviewer in case @leehinman may not have time to review this week 😄

Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a look at most code, I only have the two large test files left to check, which I'll do later today. Looking great so far!

Comment on lines -1462 to +1463
* Runs the code block for the provided interval, waiting for no assertions to trip.
* Runs the code block for the provided interval, waiting for no assertions to trip. Retries on AssertionError
* with exponential backoff until provided time runs out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change really necessary? I don't really feel like this change is super valuable (it's not wrong, just not absolutely necessary) and since it's the only change in this file, I'd prefer to revert it to reduce the scope of the PR a bit.

Copy link
Contributor Author

@samxbr samxbr Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding the comment makes it clearer that it will retry until timeout, I wasn't sure about whether it will retry or not from the original comment, and had to read the code. Maybe it's just me though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah that's why I said you're not wrong. I'm just not a huge fan of PRs changing unrelated files. I prefer PRs to have a more defined scope to make them easier to read and for commit history to be a bit more understandable.

Copy link
Contributor Author

@samxbr samxbr Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you mean, and I would usually agree with you for changes that are less trivial than this. It's pretty common for people to find opportunities for tiny improvements while working on unrelated code. I think it's adding unnecessary friction if we were to split every tiny improvements to separate PRs (like a PR just to change this line of comment). To me it's more important to continuously make tiny improvements to the code base than trying to keep every PR within its scope.

That being said, I agree with you 100% on keeping PR in scope. It's just this one feels too trivial to be "out-of-scope".

"invalid missingSnapshotUnhealthyThreshold ["
+ missingSnapshotUnhealthyThreshold.getStringRep()
+ "]: "
+ "time is too short, expecting at least more than the interval between snapshots ["
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards keeping the validation as well. I also don't really see a reason why we would allow users to do this, but I do see value in avoiding support tickets where users to do this. Plus, we get unhealthyIfNoSnapshotWithin > 0 validation for free.

Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I hadn't come around to checking the test classes yet. I've reviewed everything now, so we're almost good to go from my end :)

Comment on lines -1462 to +1463
* Runs the code block for the provided interval, waiting for no assertions to trip.
* Runs the code block for the provided interval, waiting for no assertions to trip. Retries on AssertionError
* with exponential backoff until provided time runs out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah that's why I said you're not wrong. I'm just not a huge fan of PRs changing unrelated files. I prefer PRs to have a more defined scope to make them easier to read and for commit history to be a bit more understandable.

@samxbr samxbr requested a review from nielsbauman February 13, 2025 08:45
Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating on this, @samxbr!

@samxbr samxbr merged commit 5d48ded into elastic:main Feb 14, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants